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ABSTRACT 

Consider  the  string  matching  problem  where  differences  between 
characters  of  the  pattern  and  characters  of  the  text  are  allowed.  Each 
difference  is  due  to  either  a  mismatch  between  a  character  of  the  text  and  a 
character  of  the  pattern  or  a  superfluous  character  in  the  text  or  a  superfluous 
character  in  the  pattern.  Given  a  text  of  length  n,  a  pattern  of  length  m  and 
an  integer  k,  we  present  an  algorithm  for  finding  all  occurrences  of  the 
pattern  in  the  text,  each  with  at  most  k  differences.  It  runs  in  0(m  +  k?n) 
time  for  alphabet  whose  size  is  fixed.  For  general  input  the  algorithm 
requires  0(mlogm  +  hn)  time.  In  both  cases  the  space  requirement  is  O(m). 

I.  INTRODUCTION 

In  the  known  problem  of  pattern  matching  in  strings  (e.g.,  as  discussed  in  [KMP])  we 
are  interested  in  finding  all  occurrences  of  a  pattern  in  a  text.  In  the  present  paper  we 
consider  the  following  problem: 

The  string  matching  with  k  differences  problem.  (In  short  the  k  differences  problem).  Input. 
Two  strings:  a  pattern  of  length  m  and    a  text  of  length  n  and  an  integer  IkaO,    Find  all 
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occurrences  of  the  partem  in  the  text  with  at  most  k  differences. 

Example.  Let  the  text  be  abcdefghi  ,  the  pattern  bxdyegh  and  it  =  3.  Let  us  see  whether  there 
is  an  occurrence  with  *s  k  differences  that  starts  at  the  second  location  of  the  text.  For  this 
we  propose  the  following  correspondence  between  bcdefghi  and  bxdyegh.  1.  b  (of  the  text) 
corresponds  to  b  (of  the  pattern)  7.  c  to  x.  3.  d  to  d.  4.  Nothing  to  y.  5.  *  to  e.  6.  /  to 
nothing.  7.  g  to  g     8.  .';  to  h.  The  correspondence  can  be  illustrated  as  follows, 

b xdy e     g  h 
bed    efghi 

In  only  three  places  the  correspondence  is  between  non-equal  characters,  implying  that  there 
is  an  occurrence  of  the  pattern  at  the  second  location  of  the  text  with  3  differences,  as 
required. 

We  distinguish  three  types  of  differences,  (a)  A  character  of  the  pattern  corresponds  to  a 
different  character  of  the  text.  (Item  2  in  the  Example).  In  this  case  we  say  that  there  is  a 
mismatch  between  the  two  characters,  (b)  A  character  of  the  pattern  corresponds  to  "no 
character"  in  the  text.  (Item  4).  (c)  A  character  of  the  text  corresponds  to  "no  character"  in 
the  pattern.  (Item  6). 

The  problem  of  string  matching  with  k  mismatches,  (in  short  the  k  mismatches  problem), 
was  considered  in  [I]  and  [LV].  In  the  k  mismatches  problem  we  find  all  occurrences  of  the 
pattern  in  the  text  with  at  most  it  differences  of  type  (a).  Differences  of  types  (b)  or  (c)  are 
not  allowed.  [LV]  give  an  0(kmlog  m  +  kn)  algorithm  for  the  problem.  [I]  gives  a  linear 
time  algorithm  for  fixed  k.  However,  its  running  time  grows  very  rapidly  as  function  of  it. 

The  k  differences  problem  has  a  strong  pragmatical  flavor.  In  practice,  we  often  need  to 
analyze  situations  where  the  data  is  not  completely  reliable.  Specifically,  consider  a  situation 
where  the  strings  which  are  the  input  for  our  problem  contain  errors  and  we  still  need  to  find 
all  possible  occurrences  of  the  pattern  in  the  text  as  in  reality.  The  errors  may  include  a 
character  being  replaced  by  another  character,  a  character  being  omitted  or  a  superfluous 
character  being  inserted.  Assuming  some  bound  on  the  number  of  errors  would  clearly  imply 
our  problems.  Applications  of  our  solution  for  the  it  differences  problem  in  Molecular 
Biology  are  discussed  in  [LVN].  [SK]  give  a  comprehensive  review  of  applications  of  the  it 
differences  problem. 

We  give  a  new  algorithm  for  the  problem.  The  algorithm  has  two  implementations. 
The  first  runs  in  0(m2  +  it2/!)  time  and  0{m2)  space  for  general  input.  The  second  runs  in 
0(m  +  ^n)  time  and  O(m)  space  for  alphabet  whose  size  is  fixed.  For  general  input  the 
second  implementation  needs  0(mlogm  +  Jt^n)  time  and  the  same  space.    The  algorithm  is 


designed  for  a  random-access-machine  (RAM)  [AHU]. 

Our  algorithm  consists  of  a  pattern  analysis  part  to  be  followed  by  a  text  analysis. 

In  the  first  implementation  we  have  achieved  0(m2)  time  for  the  pattern  analysis  and  O^n) 

time  for  the  text  analysis. 

In  the  second  one  we  have  actually  achieved  0(m)  time  for  the  pattern  analysis  for  alphabet 

whose  size  is  fixed  and  0(mlogm)  time  for  general  input.    While  for  the  text  analysis  we 

achieved  O(lrn)  time. 

Besides  its  relative  simplicity,  our  first  implementation  is  not  without  merit  even  with  respect 

to   the   second   implementation.     Specifically,    whenever   the   time   for   the   text   analysis 

dominates  the  computation  time,  the  running  time  of  our  first  implementation  is  close  to 

optimal.   There  are  a  few  realistic  possibilities  where  this  happens: 

1.  The  same  pattern  has  to  be  matched  with  different  texts  or  the  pattern  is  known  in  advance 
and  we  have  plenty  of  time  to  analyze  it. 

2.  m  is  sufficiently  small  with  respect  to  n  .  Specifically,  m  =  o(kvn). 

Perhaps  surprisingly  we  were  able  to  adopt  *he  conservative  and  simple  framework  of 
[KMP]  in  the  algorithm.  That  is,  we  first  build  9  table  based  on  analyst:  of  the  pattern. 
Then,  we  examine  the  text  from  left  to  right  checking  possible  occurrences  with  respect  to 
one  starting  location  (in  the  text)  at  each  iteration.  Besides  the  tables  built  in  the  pattern 
analysis,  the  input  to  each  iteration  consists  of  the  knowledge  acquired  in  previous  iterations. 
The  rightmost  location  in  the  text  to  which  we  arrived  in  a  previous  iteration  is  of  particular 
significance.  Each  iteration  consists  of  manipulating  this  knowledge.  And  if  necessary  (till 
this  rightmost  location  we  have  not  sufficient  evidence  to  exclude  the  possibility  of 
occurrence),  we  proceed  to  investigate  the  text  to  the  right  of  this  rightmost  location. 
We  use  a  further  analogy  to  the  [KMP]  algorithm  for  the  known  string  matching  problem  in 
order  to  explain  our  contribution  in  the  algorithm  for  the  k  differences  problem.  Consider 
the  following  most  trivial  strings  equality  problem:  Given  two  strings,  find  if  they  are 
identical.  Observing  this  problem  and  its  immediate  solution  was  clearly  a  step  (even  if  it 
was  very  minor)  in  devising  string  matching  algorithms.  Our  presentation  is  strongly 
motivated  by  this  anajogy  in  the  following  sense.  The  presentation  has  two  major  steps. 
In  the  first  step  we  define  an  auxiliary  problem  which  is  analogous  to  the  string  equality 
problem  when  the  *  differences  problem  is  considered  (instead  of  the  string  matching 
problem).  The  algorithm  for  the  auxiliary  problem  uses  known  techniques  ([U]).  Our 
contribution  is  in  providing  the  second  major  step.  That  is,  we  give  an  algorithm  for  the  Jk 
differences  problem  using  the  algorithm  for  the  first  step.  The  auxiliary  problem  of  the  first 
major  step  is  less  obvious  than  the  strings  equality  problem  and  provides  an  essential  part  of 
our  algorithm.  However,  we  feel  that  our  contribution,  which  is  analogous  to  the  whole 
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algorithm  of  [KMP],  overcomes  a  more  involved  and  general  problem  than  in  [KMP]. 

For  the  original  string  matching  problem  and  the  k  mismatches  problem,  the  definition 
of  the  problems  imply  immediately  algorithms  which  run  in  0(jnn)  time.  [S]  gives  a  simple 
0(nm)  time  algorithm  for  the  *  differences  problem.  In  his  survey  on  future  directions  for 
research  in  string  matching,  Z.  Galil  [G85]  discusses  the  k  mismatches  problem,  which  seems 
substantially  easier  than  our  present  problem.  He  states  that  it  is  an  open  question  whether 
the  naive  algorithm  for  the  k  mismatches  problem  which  takes  0{mn)  time  can  be  improved. 
Note  that  the  bound  on  the  running  time  of  our  algorithm  is  much  smaller  than  0{nm).  We 
already  noted  that  the  case  *  =  0  is  the  extensively  studied  string  matching  problem.  There 
are  a  few  notable  algorithms  for  the  string  matching  problem:  linear  time  serial  algorithms  - 
[BM],  [GS],  [KMP],  [KR]  (a  randomized  algorithm)  and  [V],  parallel  algorithms  -  [G84]  and 
[V\.  Note  that  none  of  these  algorithms  is  suitable  to  cope  with  our  problems. 

[U85]  presents  an  interesting  algorithm  for  the  k  differences  problem  whose  pattern 
analysis  takes  exponential  time.  The  algorithm  runs  in  time  0(m\1\G  +  n)  and  requires 
0((|S|  +  m)G)  space,  where  |2|  is  the  size  of  the  alphabet  and  G  =  min(3m  ,  2*|2|*mi+1). 
Preprocessing  of  the  pattern  takes  0(m\2\G)  time.  Then  analysis  of  the  text  takes  0(n)  time 
which  is  pretty  impressive.  However,  the  author  himself  seems  to  be  aware  that  the  space 
and  preprocessing  time  requirements  make  the  algorithm  impractical,  in  general.  (For 
comparison,  the  space  requirement  of  our  first  implementation  is  0(m2)  and  of  the  second  is 
O(m)). 

Section  2  gives  an  exceedingly  simple  algorithm  for  string  matching  with  one  different. 
Section  3  presents  the  pattern  analysis  part  of  the  algorithm  for  the  *  differences  problem. 
Section  4  presents  the  text  analysis  part.  Both  sections  3  and  4  discuss  the  first 
implementation  of  the  algorithm  only.  The  second  implementation  is  given  in  section  5. 


II.  PATTERN  MATCHING  WITH  1  DIFFERENCE  IN  LINEAR  TIME 

Below,  we  describe  a  simple  algorithm  for  finding  all  occurrences  of  the  pattern  in  the 
text  with  at  most  one  difference.  [ML]  showed  how  to  apply  the  [KMP]  algorithm  for  the 
following  slightly  modified  string  matching  problem.  Find  for  each  location  of  the  text 
whether  an  occurrence  of  the  pattern  starts  at  it.  For  each  location  in  which  there  is  no 
occurrence  of  the  pattern  find  the  leftmost  character  in  which  there  is  a  mismatch. 
The  algorithm  for  pattern  matching  with  1  difference  has  three  steps. 

1.  Run  this  modified  string  matching  algorithm  on  the  input  text  and  pattern.  Suppose  that  no 
occurrence  of  the  pattern    starts  at  character  t,  of  the  text.  Denote  by  /(/')  the  leftmost 
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mismatched  character  of  text  for  the  pattern  starting  at  t,.  The  modified  string  matching 
algorithm  would  find  /(»')• 

2.  Inverse  the  text  (if  it  was  r,...,f„  let  it  be  *„,*„_; fj  and  the  pattern  and  run  the 

same  algorithm.  Suppose  that  no  occurrence  of  the  pattern  starts  at  location  t,  of  the  text. 
Denote  by  r(i')  the  rightmost  mismatched  character  of  text  for  the  pattern  starting  at  tt.  The 
modified  string  matching  algorithm  would  find  r(i). 

Recall  the  three  types  of  differences  as  defined  in  the  Introduction. 

3.  Suppose  that  no  occurrence  of  the  pattern  starts  at  location  f,  of  the  text. 

(i)  For  all  i  such  that  /(»')  =  r(i')  ,  we  can  conclude  that  there  is  an  occurrence  of  the  pattern 

starting  at  tt  with  one  mismatch  (difference  of  type  (a)). 

(ii)  For  all  i  such  that  /(«)  >  r(i-l),  we  can  conclude  that  there  is  an  occurrence  of  the 

pattern  starting  at  t,  with  at  most  one  difference  of  type  (b). 

(iii)  For  all  i  such  that  /(/)  &  r(/  +  l),  we  can  conclude  that  there  is  an  occurrence  of  the 

pattern  starting  at  r,  with  at  most  one  difference  of  type  (c). 

Remarks.  1.  The  running  time  of  this  algorithm  is  0(n  +  m).  2.  We  leave  it  to  the 
interested  reader  to  find  how  to  generalize  this  algorithm  for  a  wider  definition  of  this 
problem  where  one  "successive  chunk  of  differences"  is  allowed. 


III.  STRING  MATCHING  WITH  Jt  DIFFERENCES  -  PATTERN  ANALYSIS  (ilrst 
implementation). 

The  algorithm  has  two  parts: 

(a)  The  pattern  analysis.  We  build  a  table  which  is  based  on  analysis  of  the  pattern. 

(b)  The  text  analysis.  We  show  how  to  find  efficiently  all  the  occurrences  of  the  pattern  in 
the  text  with  at  most  k  differences,  using  the  result  of  the  pattern  analysis. 

Analysis  of  the  pattern 

The   input   to   the   pattern    analysis   is   the   pattern,   which   is   given   as   an   array 
A  =  a,,  .  .  .  ,am.  The  output  of  the  pattern  analysis  is  a  two  dimensional  array 
MAX-LENGTH[0,...,m-lfl,...,m-l].  MAX  -  LENGTH  (ij)=f  means  that 

,+1 a,+/  =    ay+i «/+/,  and  a/+/+1  *  «/+/+i-    In  words,  consider  laying  the 


a 


suffix   of  the  pattern   starting   at  al+l  over  the  suffix   of  the  pattern  starting  at  ay+1. 
MAX -  LENGTH (ij)  is  the  longest  match  of  prefixes  between  these  two  suffixes. 
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Define  the  pair  (ij)  ,0s  ijsm-1  to  be  on  diagonal  d  if  i-j  =  d  where  possible 
values  of  d  are  -(m-1)  s  d  ss  m-1.  We  compute  each  diagonal  <f  of  MAX -LENGTH 
separately.  Below,  we  give  the  algorithm  for  any  diagonal  d,  m-1  a  d  &  0.  The  algorithm 
for  diagonals  d  where  —  (m  —  1)  <  rf  <  0  is  similar.  Note,  however,  that  a  trivial  change  in 
the  initialization  step  is  required. 

Initialization. 

M AX- LENGTH  (m-l,m-l-d)  :=  1     \iam  =  am_a 

MAX -  LENGTH (m  -  l,m- 1-  d)  :=  0    otherwise 

The  computation  proceeds  along  decreasing  indices  of  diagonal  d, 

M AX- LENGTH (i,i-d)  :=  l+MAX-LENGTH(i+l,i-d+l)  if  fl^-aftf-i 

M AX- LENGTH (i,i-d)  :=  0  otherwise 

Complexity.  The  number  of  operations  performed  by  the  algorithm  for  each  diagonal  is 
proportional  to  the  number  of  pairs  on  the  diagonal.  Therefore,  the  total  number  of 
operations  performed  by  the  algorithms  for  all  diagonals  is  proportional  to  the  the  total 
number  of  pairs  which  is  0(m2). 


IV.  ANALYSIS  OF  THE  TEXT  (first  Implementation) 

The  input  to  the  text  analysis  consists  of  the  following: 

a)  The  pattern.  An  array  A  =  ait  .  .  .  ,am. 

b)  The  text.  An  array  t  =  ti tn. 

c)  An  integer  tsO. 

d)  The  output  of  the  pattern  analysis:  The  two  dimensional  array  MAX -LENGTH. 

Output  of  the  text  analysis:  All  occurrences  with  s  k  differences  of  the  pattern  in  the  text. 
The  description  of  the  text  analysis  uses  a  known  algorithm  for  an  auxiliary  problem.  The 
relation  between  the  auxiliary  problem  and  the  text  analysis  is  discussed  briefly  in  the 
introduction.    Since  we  wanted  this  presentation  to  be  self  contained  we  describe  this 
algorithm. 

IV.  1  The  ■axillary  problem 

Input.  Two  strings:  A  =  ax am  and  B  -  by  ....  ,  bm+k.   We  want  to  find  whether  an 

occurrence  of  A  with  at  most  *  differences  starts  at  bx.  We  first  show  an  0(m2)  time 
algorithm  for  the  auxiliary  problem.  Later,  we  show  how  to  derive  from  it  an  0{km)  time 
algorithm. 


0(m:)  time  algorithm  for  the  auxiliary  problem.    We  use  a  matrix  i?m.,WjDf>>.>w+i]i 
where  Dl}  is  the  number  of  differences  between  al ,  .  .  .  ,  a,  and  bx ,  .  .  .  ,  b,. 
It  should  be  obvious  that  if  DmJ  s  k,  for  at  least  one  I,  w-t  s  /  s  m  Hit,  then  the  answer  to 
the  auxiliary  problem  is  yes. 

The  following  algorithm  computes  the  matrix  Dm^,pt/K+IA 

Initialization      D0fi  :=  0. 

for  all  I,  1  </s  m+Jt  ,  D0J:=  I. 
for  all  i,  1  s  i  s  m  ,  £>,  0  :=  i  ,. 

for  i:=l  to  m  do 

for  l:=  1  to  m  +  k  do 

Dty.=  min  (D,_iJ+l,  D/;_:+l,  Dl.lJ.i  if  a,  =  bs  or  D,.it!^^l  otherwise). 

(DSj  is  the  minimum  of  three  numbers.    These  numbers  are  obtained  from  the 

predecessors  of  DtJ  on  its  column,  row  and  diagonal,  rt-spectively). 

This  algorithm  clearly  runs  in  0(m:)  time. 

Q(km)  time  algorithm  for  the  auxiliary  problem.   (Due  to  [U]). 
Diagonal  d  of  the  matrix  consists  of  all  Dn's  such  thai  l—i  =  d. 
Lemma  1  [U].  For  every  i,l,  D,j  -  D^^-j,  is  either  zero  or  one. 
Lemma  1  implies  that  we  can  store  the  information  of  the  matrix  in  a  more  compact  way. 
For  a  number  of  differences  e  and  a  diagonal  d,  let  Ldjl  denote  the  largest  row  i  such  that 
D, ,  =  e  and  D:;  is  on  diagonal  d.    Note,  that  this  implies  that  there  are  e  differences 

between  a: aL^  and  fc,,  .  .  .  ,bL^+^  ^d  aL^+l  4-  bL^+d+l. 

Corollary.    For  our  auxiliary  problem  we  need  only  values  of  Ld4,  where  e  and  d  satisfy 

e  ■£■  k  and  \d\  s  e  . 

Proof,  e  s  k  is  obvious.  The  initial  values  of  the  matrix  and  Lemma  1  imply  that  all  the  Dt, 

on  a  diagonal  d  are  a  \d\  and  therefore  given  a  number  of  differences  e  we  need  only  values 

of  Ldt  where  \d\  s  e. 

The  answer  to  the  auxiliary  problem  is  yes  if  one  of  the  Ld.,  (\d\  S  e  as  k),  equals  m. 

Given  d  and  e  we  describe  how  to  compute  Ldt  using  its  definition.    That  is  ,  Ldt  is  the 

largest  row  such  that  D,j  =  e,  and  Di}  is  on  the  diagonal  d.    In  the  above  0(m2)  time 

algorithm  the  assignment  of  e  into  Dl}  was  done  using  one  (or  more)  of  the  following  data: 

(a)  Dl-1j_l  (the  predecessor  of  DtJ  on  the  diagonal  d)  is  e-1  and  a,  #  b,. 

(b)  D;;-;  (the  predecessor  of  DtJ  on  row  i  which  is  also  on  the  diagonal  "below"  ef)  is  e—  1. 

(c)  £>/_!;  (the  predecessor  of  D;;  on  column  /  which  is  also  on  the  diagonal  "above"  <f)  is 
e-1. 

(d)  D/.w-i  is  also  *  and  a,  =  £>,. 

This  implies  that  we  can  start  from  DtJ  and  follow  its  predecessors  on  diagonal  d  by 
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possibility  (d)  till  the  first  time  one  (or  more)  of  possibilities  (a)  (b)  and  (c)  occur. 
The  0(km)  time  algorithm  for  the  auxiliary  problem  is  given  below.  The  reader  is  invited  to 
convince  oneself  that  the  initialization  step  (Instruction  1)  is  done  in  a  way  which  enables  the 
computation  of  the  Ld4's  in  the  subsequent  instructions.  Instructions  2-6  compute  Ldt's 
(|rf|  £  e  s  k)  by  "inversing"  the  above  description  for  computing  the  Ldy*  from  their 
definition.  Ldit-\,  ^-i,e-i»  an^  Ld+lit..:  are  used  to  initialize  the  variable  row  (Instruction 
3),  which  is  then  increased  by  one  at  a  time  till  it  hits  the  correct  value  of  Ld ,  (Instruction  4). 

The  0(km)  time  algorithm  for  the  auxiliary  problem 

[1]  Initialization 

ford:=-(k+l)to(k+l)do 

Ld,\d\-2:~  -x; 

if  d<  0 

then  irfi|rf|_i:=  \d\-l; 
else  L^itf*  -1  ; 

[2]  for  e:~0  iokdu 

for  d:=-e  to  e  do 

[3]  row  :--  max[{Ld, _,+  !),  (L^-J,  (I*+1,-i+l)]. 
[4]  while  a^+y  =  bmv+l+d  do 

row :  =  row  + 1 . 
[5]£<j.,:=  row. 

then  print  'YES*  and  stop. 

Complexity.  The  algorithm  computes  Ldt  for  2Jt+l  diagonals.  For  each  diagonal 
variable  row  can  get  at  most  m  different  values.  Therefore,  the  computation  takes  0(km) 
time. 


IV.  2  Back  to  the  text  analysis 

Overview  of  the  text  analysis.  Let  us  go  back  to  the  text  analysis.  The  text  analysis  consists 
of  n  —  m  +  k  iterations.  At  iteration  i  we  check  if  an  occurrence  with  s  k  differences  of  the 
pattern  starts  at  r/+]_.  Let  t,  be  the  rightmost  symbol  in  the  text  that  was  reached  at  an 
iteration  prior  to  i.  Assume,  w.l.g.,  that  we  reached  tj  for  the  first  time  at  iteration  r, 
Osr<i, 

Example  1.    Let   f17 r^    be    abaaaeddaedcab ,        alt  .  .  .  ,a^   be   aaaaeddedebab    and 

k  =  4.  Suppose  r=16  and  j  =  30.  The  correspondence 

1  2  3  4  5  6  7  8  9  10  11  12  13 
a  a  a  a  e  d  d  c  d  c  b  a  b 
abaaaeddaede  a  b 
17  18  19  20  21  22  23  24  25  26  27  28      29  30 

gives  *  differences.  It  can  be  easily  checked  that  a  correspondence  with  less  differences  is 
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impossible. 

The  definition  of  the  text  analysis  (given  later)  implies  that  there  are  at  most  Jt+1  differences 
between  rr+ ,,..., tj  and  some  prefix  of  the  pattern.  Hence,  there  are  also  at  most  k+1 
differences  between  some  suffix  of  this  prefix  of  the  pattern  and  tl+i,...,t,.  We  call  this 
suffix  of  prefix  of  the  pattern  the  subpattern.  In  the  example  let  i  be  20  and  the  subpattern 
will  be  aA,  .  .  .  ,%  This  means  that  for  some  correspondence  between  symbols  of 
r(+1,  .  .  .  ,tj  and  the  subpattern  there  are  at  least  j—i—k  symbols  of  r/+1,  .  .  .  ,tj  that  have  a 
match  in  the  subpattern.  It  is  easy  to  see  that  all  the  symbols  of  the  text  that  have  successive 
matches  in  this  correspondence  form  at  most  k+1  (successive)  substrings  in  tt+i,...,tj.  For 
each  such  substring  we  know  its  corresponding  substring  in  the  pattern.  Suppose  a  substring 
of  the  text  tp  +  l,  .  .  .  ,tp+^  matches  a  substring  of  the  pattern  ac+l,  .  .  .  ,ac+f  and 
tpJrf+l*  ac+j+l,  we  denote  this  by  the  triple  (p,c J).  There  are  at  most  k  symbols  in 
r/+1,  .  .  .  ,tj  which  do  not  have  matching  symbols  in  the  subpattern.  We  denote  each  such 
symbol  th+1  by  the  triple  (h ,0,0).  We  denote  the  sequence  of  this  triples  by  S;j. 
In  example  1,  S:030  is  (20,3,1),  (21,0,0),  (22,5,2),  (24,0,0),  (25,7,3),  (28.11,2). 

Recall  that  at  iteration  i  we  want  to  find  if  ai am  occurs  at  r,+1,...   .  This  seems 

similar  to  the  auxiliary  problem.  However,  here  we  have  more  information:  the  sequence  S,j 
and  MAX-LENGTH.  Iteration  i  uses  this  information. 

Iteration  »'. 

Iteration  i  consists  of  the  0(km)  time  algorithm  of  the  auxiliary  problem  with  the 
following  modification  in  Instruction  4  .  Instruction  4  increases  the  variable  row  by  one  at  a 
time.  The  sequence  S,j  and  MAX-LENGTH  enable  to  increase  row  by  much  larger  jumps  as 
long  as  we  do  not  require  information  about  symbols  of  the  text,  which  are  beyond  tj.  Once 
row  takes  us  beyond  t,  (i.e.,  i  +  row  +  1  +  d  >  j),  S[j  and  MAX-LENGTH  do  not  help  us 
any  more  and  we  apply  (the  old)  Instruction  4  as  in  the  computation  of  the  auxiliary 
problem. 

We  finish  this  overview  of  iteration  i  by  showing  how  to  apply  the  sequence  S^  and 

MAX -LENGTH  to  obtain  these  jumps.   The  while  loop  of  Instruction  4  looks  for  the  longest 

match     between     prefixes     of     some     suffix     of     the     text     tt+row+d+u...  ,     where 

j  +  1  s  i+row  +  d  sj  and  some  suffix  of  the  pattern  a^.^,...     .  We  explain  how  to  find 

the  maximum  w  such  that  arow+l a^^  equals  tt+row+d+1 tl+raw+d+w.    Suppose 

that  according  to  S,j  the  substring  r/+TOH,+rf+1 f;+roW+/  matches  cc+1,  .  .  .  ,ac+f  for 

some  index  c  of  the  pattern  and  c  +f  is  the  maximal  index  of  the  pattern  for  which  this  match 
holds.  We  can  find  those  c  and  /  using  the  fact  that  for  each  th+1  (i  s  h  <  j)  there  exists  a 
triple   (p^c^JJ   in  StJ   such   that  p1s*sp1  +  /1.     ((p^c^J   covers   fA  +  1).     For  the 
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computation  of  w  we  need  to  break  into  the  following  cases: 

Case(a).    /2l,    M AX- LENGTH  (c, row)    gives    the    maximal     number    g    such    that 

ac+: ac+t  equals  arow+l ««*+,■   Case  (a)  nas  ^o  subcases. 

Cose{a\).    f*g.     It    is    easy    to    see    that    here    tt+m,+d+x, '/+row+rf+mtaVrf)  = 

flrow+i "nw+mh^i    and    tl+row+d+min(fig)+l  *  ow+mh(r>f)+1.    Therefore,    we    assign 

w:=  min(f,g). 

Case(a2).  f=g.  This  implies  rf+rw+d+] tl+row+d+f  =  arm+i c^^  but  does  not 

reveal  whether  tl+row+a+f+1  equals  anjw+f+l  or  not.   Therefore,  we  assign  row:=  row+f  and 

apply  again  the  present  case  analysis  accumulating  this  "jump"  over  /  symbols  into  w . 

Case  (£»)./=  0.  Case  (b)  has  two  subcases. 

Case(b\).  tl+rvw+d+1  *  arcw+i.  Hence,  we  assign  >v:=  0. 

Case(b2).  t!+royv+d+l  =  a,w+i.   Therefort,  we  assign  row:=  row  +  1,  and  we  apply  again  the 

present  case  analysis  accumulating  this  propagation  of  1  into  w. 
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The  text  analysis  algorithm 
;:=0; 

for  i:  =  0  to  n-m+k  do 
begin 

[1]  Proper  initialization  (as  in  the  0(km)  algorithm  for  the  auxiliary  problem) 
[2]  for  e:=0  to  k  do 
for  d:=-e  to  e  do 

[3]   row  :=  mox[(Lrf>,_1,+  l),  (L^^J,  (Id+lf,_i+l)]. 
[4. new]   while   i+row  +  d+1  s  j  do 

[4.new.l]  take  from  StJ  the  triple  that  "covers"  r/+rwv+rf+1.  Derive  from 

this    triple    the    indices    cj    such    that    tt+mt+d+i ,f,+/w+4+/ 

=  flc+1,  .  .  .  ,ac+f  (f,+roH,+<f+^+1  #  oc+/+i) 
[4.new.2]  if  /  a  1 
then  (*  case  a  •) 

[4.new.3]   if  f  *  M AX  -  LENGTH (c , row) 
then  (•  case  al  •) 

row  :=  row  +  min(f,  MAX- LENGTH  (c, row)) 
go  to  5 
else  (*  case  a2  •) 

row:=  row  +  /  ; 
else  (*  case  b  •) 

[4.new.4]   if  tt+raw+d+1  *  Orow+1 
then  (*  case  bl  •) 

go  to  5 
else  (•  case  b2  •) 
row :  =  row  + 1 
od 

[4.old]   while  arm+l=tl+rxrwJrl+d  do 
row:=row+l. 

[5]  Ld,/-=  row- 
[6]   ifLdte  =  m 

then     print  'YES*  and  go  to  7 
od 
[7]   If  new  symbols  of  the  text  were  reached  (j  was  increased),  then  starting  from  the 
LdJc  (which  implies  the  new  j,  i.e.  j=LdJ[  +  d  +  i)  or  from  the  Z.^  (which  is  equal  to 
m  -  in  case  a  match  with  s  *  differences  has  been  found)  we  reconstruct  the  new  St ,. 
end 

Implementation  remarks. 

Instruction  4.new.l:  When  we  compute  L00  we  start  searching  for  the  indices  c  and/  at  the 
first  triple  of  Stj.  We  know  which  triple  was  checked,  when  any  Ldt  gets  its  value.  So, 
when  computing  a  new  Ldj  we  know  what  were  the  last  triples  we  checked  in  the 
computation  of  each  one  of  Ld^ij,^iJLdtt.lyLdJrlt,^l.  At  Instruction  3  row  got  its  initial  value 
from  the  maximum  of  ^-i^-i^^-i+l^+v-i+l-  The  last  triple  that  was  checked  in  the 
computation  of  the  one  which  gives  this  maximum  is  the  first  to  be  checked  in  the 
computation  of  Ldl. 

Instruction  7:  At  the  end  of  each  iteration  i  if  at  least  one  new  symbol  of  the  text  Was  reached 
we  have  to  create  a  new  sequence  of  triples  instead  of  SrJ.  We  show  how  to  do  it.  If  t.  is  the 
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rightmost  symbol  of  the  text  which  was  reached  in  such  an  iteration  then  denote  the  new 
sequence  S,  r  In  order  to  compute  S,rwe  hold  a  sequence  (of  triples)  for  each  Ldtt  during  its 
computation  at  iteration  i.  This  sequence  "realizes"  Ld<t.  That  is,  it  gives  a  correspondence 
between  alt  .  .  .  ,aL  and  f/+1,  .  .  .  ,t,+L^  +d,  with  exactly  e  differences. 
At  the  beginning  of  each  iteration  i  each  Ld^  has  an  empty  such  sequence.  We  use  again  the 
fact  that  initially  (at  Instruction  3)  row  is  the  maximum  among  Ld-]^-i,  1^-1+ 1,  and 
Ld+lt-:+l  and  finally  (at  Instruction  5)  is  Ld<t.  Assume  that  we  know  the  sequences  of  the 
predecessors  of  Ldt  (Namely,  the  sequences  of  Ld_li#_1(  Ld^l  and  Ld+itt_J.  We  get  the 
sequence  of  Ld  t  by  adding  triples  to  the  end  of  the  sequence  of  the  predecessor  which  gives 
the  maximum  in  initializing  row.  Let  rx  be  the  initial  value  of  row.  If  rx  got  its  value  from 
Lj-ijt-l  (or  Ld*-d  men  we  add  t0  'te  sequence  the  triple  (i+rx+d-l,  0,  0).  (Meaning  that 
for  r/+r  +d,  there  is  no  corresponding  symbol  in  the  pattern).    Following  Instruction  5,  if 

Ldt>  rx,  we  next  add  the  triple  (i  +  r1+d,ritZ,^-r1)  to  the  sequence  of  LdA.  This  is  done 
regardless  of  whether  the  source  of  r:  was  Ld.l<l.l  or  Ldtt^1  or  I^+i^-i-  (This  triple 
describes  the  match  between  substrings  of  the  pattern  and  the  text  which  was  found  during 
the  computation  of  Ldt  given L^^-^  Ldt.l  and Ld+i^.J. 

At  the  end  of  iteration  i  we  check  which  of  the  2*  + 1  sequences  reached  the  rightmost  symbol 
of  the  text.  If  the  index  of  this  symbol  is  greater  than  /  (Ld  +d+i  >  j),  then  we  take  its 
sequence  to  be  the  new  S,j. 

Complexity.  The  old  instruction  4  (where  row  is  increased  by  one  at  a  time  without 
using  Si  j  and  MAX -LENGTH)  is  employed  each  time  we  move  to  a  new  symbol  of  the  text. 
We  maintain  O(k)  diagonals  at  any  time  during  the  text  analysis  and  may  need  to  compare 
the  new  symbol  for  each  of  them.  Hence,  the  old  Instruction  4  requires  a  total  of  O(Jbi)  time 
throughout  the  text  analysis.  In  order  to  evaluate  the  number  of  steps  which  are  required  by 
the  new  Instruction  4  at  iteration  i,  we  use  again  the  fact  that  0{k)  diagonals  are  computed. 
The  sequence  S,j  has  at  most  2k  + 1  triples.  We  can  charge  each  operation  performed  on  any 
one  of  the  diagonals  to  either  a  difference  being  discovered  (there  are  s  k  such  differences), 
or  to  a  triple  of  S,^  being  examined  (there  are  rs  2k+l  triples).  This  amounts  to  0(k) 
operation  per  diagonal  at  each  iteration  i.  Therefore,  the  total  running  time  of  the  text 
analysis  is  O^n). 
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V.  STRING  MATCHING  WITH  k  DIFFERENCES  (second  Implementation) 

The  second  implementation  of  the  k  differences  algorithm  is  similar  to  the  first  one.  In 
this  section,  we  describe  only  the  changes  with  respect  to  the  first  implementation. 

Analysis  of  the  pattern 

Input.  A  pattern  alt  .  .  .  ,am 

We  define  the  suffixes  tree  as  follows: 

1)  All  the  edges  of  the  tree  are  directed  away  from  tfle  root.  The  out  degree  of  each  node  of 
the  tree  is  either  zero  (if  the  node  is  a  leaf)  or  2:  2. 

2)  Each  edge  of  the  pattern  corresponds  to  some  successive  substring  of  the  pattern 
a„al+1,...,a,,  where  lsisjsm,  For  each  node  v  of  the  tree  there  is  a  directed  path 
from  the  root  to  v.  Concatenating  all  substrings  which  correspond  to  edges  along  this  path 
yields  a  string  which  corresponds  to  v. 

3)  The  tree  has  m  leaves,  each  corresponding  to  a  different  suffix  of  the  pattern. 

Remark.  Up  to  isomorphism  (of  graphs)  there  is  only  one  suffixes  tree  for  a  given  pattern. 
EXAMPLE.  Given  the  pattern  abab$  the  subwords  tree  is: 


(a) 

The  suffixes  tree 


(b) 
The  output  of  the  pattern  analysis 
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The  output  of  the  pattern  analysis  should  include  the  suffixes  tree  of  the  pattern. 
Specifically,  each  node  v  of  ths  tree  (besides  the  root)  will  have:  1.  A  pointer  to  its  father 
(the  node  on  the  other  side  of  its  single  incoming  edge  in  the  tree).  2.  The  number  of 
characters  in  the  string  which  corresponds  to  the  path  from  the  root  to  v. 
Note  that  the  number  of  characters  in  the  string  which  corresponds  to  the  path  from  the  root 
to  a  leaf  readily  implies  which  suffix  corresponds  to  this  leaf. 

Complexity.  We  refer  the  reader  to  [W]  for  computation  of  the  suffixes  tree  in  0{m) 
time  when  the  size  of  the  alphabet  is  fixed.  If  the  alphabet  of  the  pattern  contains  x  letters 
then  it  is  easy  to  adapt  the  algorithm  of  [W]  and  the  whole  pattern  analysis  to  run  in  time 
0(m\ogx).  In  both  cases  the  space  requirement  of  the  pattern  analysis  is  0{m).  (The  reader 
is  also  referred  to  [CS]  in  which  a  lucid  presentation  of  the  algorithm  of  [W]  is  given). 


Analysis  of  the  text  Recall  that  array  MAX-LENGTH  v.hich  was  computed  in  the 
pattern  analysis  of  the  first  implementation  had  the  following  information. 
MAX  -  LENGTH (i  J)  =  f  meant  that  a(+1,...a(+y  is  equal  to  aj+l,  .  .  .  ,aJ+y  and 
a,+y+1  #  a/+/+i-  The  text  analysis  of  our  algorithm  remains  essentially  unchanged  in  this 
implementation.  However,  information  that  appeared  in  MAX-LENGTH  is  not  as  readily 
available  now  as  before.  The  idea  is  to  consider  each  time  MAX- LENGTH (i  J)  is  needed  as  a 
query  which  has  to  be  satisfied.  So,  it  remains  to  show  how  to  satisfy  the  query 
MAX- LENGTH  (i  J)  using  the  output  of  the  pattern  analysis.  Let  YtJ  be  the  lowest  common 
ancestor  (in  short  LCA)  of  the  leaves  corresponding  to  the  suffixes  a/+1>  .  .  .  ,am  and 

ay+1 am  in  the  suffixes  tree.  Observe  that  the  path  from  the  root  to  YtJ  corresponds  to 

a  string  of  the  form    al+l al+j  or  ay+1,  .  .  .  .<*/+/,  and  al+y+i  =*  fly+/+i-  But  /  's  ^e 

number  of  characters  in  the  string  corresponding  to  the  path  from  the  root  to  Ytj.  Thus,  the 
problem  of  computing  the  query  MAX -  LENGTH (i  J)  is  reduced  to  finding  the  LCA  YtJ.  We 
use  the  algorithm  of  [HT]  for  the  purpose  of  computing  all  the  queries  that  arise  throughout 
the  text  analysis. 

m 

Complexity.  Each  query  of  the  form  M AX  -  LENGTH  (i  J)  requires  £1(1)  time  in  the 
first  implementation  of  the  algorithm.  Since  the  analysis  of  the  text  took  there  Oi&n)  time, 
only  0(k-n)  such  queries  may  arise  throughout  the  text  analysis.  Using  the  classification  of 
[HT]  we  are  interested  in  the  static  lowest  common  ancestors  problem,  where  the  tree  is  static 
but  the  queries  are  given  on  line.  That  is,  each  query  must  be  answered  before  the  next  one 
is  known.  The  tree  which  is  the  output  of  the  pattern  analysis  has  0(m)  nodes.  [HT]  gave  an 
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algorithm  for  the  problem  which  requires  0(o  +  0)  time  and  OO)  space  when  a  is  the 
number  of  queries  and  0  is  the  number  of  nodes  in  the  tree.  Applying  their  algorithm  clearly 
yields  O(lrn)  time.  Given  MAX-LENGTH,  the  additional  space  requirements  of  the  first 
implementation  is  0(k).  Therefore,  we  need  0(m+k)~0(m)  space  for  tht  teJt  amJysis  of 
the  second  implementation. 

Acknowledgement.  We  are  grateful  to  Ed  Schonberg  for  his  comment  that  simplified 
considerably  the  pattern  analysis  part  of  the  first  implementation  of  the  algorithm. 
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